Gradient based optimization of kernel hyper-parameters

Prior work

  1. Discovering and Exploiting Additive Structure for Bayesian Optimization
    1. No discusssion of how the kernel parameters is optimized
    2. No code to glean method of optimization
  2. Batched High-dimensional Bayesian Optimization via Structural Kernel Learning
    1. No discussion on optimization of kernel parameters
      1. From the experiments code, it seems that the hyper paramters are fixed and not learnt
    2. Does provide faclitiy to learn kernel parameters;
      1. $P(\eta|z)$, then $P(z|\eta)$, using 0th order optimization

Prior method

  1. For iteration dim_n_iter:
    1. For every dimention d:
      1. For every (ls[d], var[d]) drawn uniformly from some distance around $D_0 = {ls[d]_0,var[d]_0}$
      2. Evaluate the full likelihood for every possible pair.
      3. Set the best (ls[d], var[d]) to be the dimentional parameter for d

The reason why the prior method works well: The optimization is quite poor as the method simply performs 2/1D hill climbing. This allows the graph learning to take precedence, which in most instances would allow the solution to converge to a much better value. In other words, the effect of graph learning on regret is much more important than the parameter, which is the reason why additive model is effective.

Performance Issue

There is also another issue - the performance characteristics is poor - $O(d\times\text{dim_n_iter}\times\text{param_budget})$, even though it runs linearly in $d$ dimentions, it takes a lot of time as the inference takes $O(n^3)$. The full complexity would be:

$$ O(d\times\text{dim_n_iter}\times\text{param_budget}\times n^3) $$

Which will be a problem when the iterations is large.

Improved method using analytical gradients

The gradients with respect to lengthscale is trival as it would follow the derivation for the RBF-ARD kernel.

Due to the our definition of the kernel scale parameter for a sub function that has variable group $G$: $$ \sigma_G = \sqrt{\sum_{i\in G} \sigma_i^2} $$

We know that: $$ \DeclareMathOperator{\Tr}{Tr} \frac{dL}{d\sigma_j} = \frac{1}{2}\Tr\Bigg[(\alpha\alpha^T-K^{-1})\sum_{G\in\mathsf{G}} \big(\frac{dk_G}{d\sigma_G}\frac{d\sigma_G}{d\sigma_j}\big) \Bigg] $$ And that: $$ \frac{dk_G}{d\sigma_G} = \frac{K}{\sigma_G} $$ Hence: $$ \frac{d\sigma_G}{d\sigma_j} = \frac{d}{d\sigma_j}\Bigg(\sqrt{\sum_{i\in G} \sigma_i^2}\Bigg) = \frac{\sigma_j}{\sigma_G} $$

Thereafter, we can just use the likelihood $L$ with the gradients $\frac{dL}{dp}$ where $p=[l_0, var_0, \cdots]$ and with some reasonable bounds on lengthscale $(1e-2, 1e5)$ and variance $(0.31622776601683794, 1e5)$. Bascially using any of the gradient based optimization with bounds:

  1. Using LBFGS-B
  2. Using TNC
  3. Using SLSQP

Note that SLSQP was not tried as I think it is too simple and would not work well in higher dimensions.

Using both optimizers with default settings

We observe an increase in the amount of time required to run the experiments. This is to be expected as the amount of iterations is more than the prior method.

Performance was all over the map in these settings - see 27 May 2020 for LBFGS-B run. Some were equlivant and some were worse for regret. A closer analysis shows that the model is having problems learning a better graph structure.

The core issue is that the optimized kernel parameters is biasing too much too early towards a particular graph structure and causing the learning of the graph structure to be severely hindered, and stuck in local optima.

Solution

The solution would be to slowly increase the accuraccy of the kernel parameters. At early iteration, we would want the kernel parameters to be less effective and the later iterations to be more effective.

I have tried with both LBFGS and TNC:

  • L-BFGS-B: decay the factr from 1e18 to 1e8 - this roughly correspond with accuracy.
  • TNP: decay the maxfunc which is the maximum number of function evaluations from 1 to dim dim_n_iter 2

I will discuss a little bit more when we meet!

In [ ]: